Skip to content

Conversation

scottgerring
Copy link
Member

@scottgerring scottgerring commented Aug 12, 2025

Fixes #3081, building on the work started by @AaronRM 🤝

Changes

A new retry module added to opentelemetry-sdk

Models the sorts of retry an operation may request (retry / can't retry / throttle), and provides a helper retry_with_backoff mechanism that can be used to wrap up a retryable operation and retry it. The helper relies on experimental_async_runtime for its runtime abstraction, to provide the actual pausing. It also takes a lambda to classify the error, so the caller can inform the retry mechanism if a retry is required.

A new retry_classification module added to opentelemetry-otlp

This bit takes the actual error responses that we get back over OTLP and maps them back to the retry model. Because this is OTLP-specific stuff it belongs here rather than alongside the retry code.

Retry binding

... happens in each one of the concrete exporters to tie it all together.

Also ...

  • Extended exporter builders to allow the user to customise default retry policy
  • Added new feature flags experimental-http-retry and experimental-grpc-retry which pull in the experimental-async-runtime dep and set everything up. This way we can get going with this now without having to stabilise the experimental-async-runtime feature.

Open Questions

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

Copy link

codecov bot commented Aug 12, 2025

Codecov Report

❌ Patch coverage is 74.12077% with 780 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.8%. Comparing base (ad88615) to head (9ba3a06).
⚠️ Report is 224 commits behind head on main.

Files with missing lines Patch % Lines
opentelemetry-otlp/src/exporter/http/mod.rs 67.0% 162 Missing ⚠️
opentelemetry-sdk/src/metrics/data/mod.rs 13.4% 154 Missing ⚠️
opentelemetry-proto/src/transform/metrics.rs 11.1% 64 Missing ⚠️
...-sdk/src/metrics/internal/exponential_histogram.rs 65.1% 52 Missing ⚠️
opentelemetry-otlp/src/exporter/tonic/metrics.rs 0.0% 50 Missing ⚠️
opentelemetry-otlp/src/exporter/tonic/trace.rs 0.0% 48 Missing ⚠️
opentelemetry-otlp/src/exporter/tonic/logs.rs 0.0% 46 Missing ⚠️
opentelemetry-otlp/src/exporter/tonic/mod.rs 73.4% 42 Missing ⚠️
opentelemetry-sdk/src/metrics/instrument.rs 88.9% 29 Missing ⚠️
opentelemetry-sdk/src/logs/logger_provider.rs 92.0% 12 Missing ⚠️
... and 28 more
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #3126     +/-   ##
=======================================
+ Coverage   79.6%   80.8%   +1.2%     
=======================================
  Files        124     128      +4     
  Lines      23174   23090     -84     
=======================================
+ Hits       18456   18676    +220     
+ Misses      4718    4414    -304     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@scottgerring scottgerring force-pushed the feat/retry-logic branch 4 times, most recently from 3847b26 to fb141db Compare August 12, 2025 10:22
@scottgerring scottgerring changed the title [not ready!] feat: support backoff/retry feat: support backoff/retry in OTLP Aug 12, 2025
@scottgerring scottgerring marked this pull request as ready for review August 19, 2025 14:32
@scottgerring scottgerring requested a review from a team as a code owner August 19, 2025 14:32
@lalitb lalitb requested a review from Copilot September 1, 2025 18:50
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements retry logic with exponential backoff and jitter for OTLP exporters to handle transient failures gracefully, addressing issue #3081. The implementation supports both HTTP and gRPC protocols with protocol-specific error classification and server-provided throttling hints.

  • Adds a new retry module to opentelemetry-sdk with configurable retry policies and exponential backoff
  • Implements protocol-specific error classification in opentelemetry-otlp for HTTP and gRPC responses
  • Integrates retry functionality into all OTLP exporters (traces, metrics, logs) for both HTTP and gRPC transports

Reviewed Changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
opentelemetry-sdk/src/retry.rs Core retry module with exponential backoff, jitter, and error classification
opentelemetry-otlp/src/retry_classification.rs Protocol-specific error classification for HTTP and gRPC responses
opentelemetry-otlp/src/exporter/tonic/*.rs gRPC exporter integration with retry functionality
opentelemetry-otlp/src/exporter/http/*.rs HTTP exporter integration with retry functionality
opentelemetry-otlp/Cargo.toml Feature flags and dependencies for retry support

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@scottgerring scottgerring force-pushed the feat/retry-logic branch 2 times, most recently from af933a2 to f1636a0 Compare September 2, 2025 09:11
Copy link
Contributor

@bantonsson bantonsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the HTTP exporters look good now. Love all those red lines.

Copy link
Contributor

@bantonsson bantonsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼 for the HTTP code. I can't see a clear way to reuse more of the Tonic code.

@lalitb lalitb self-assigned this Sep 16, 2025
@lalitb
Copy link
Member

lalitb commented Sep 16, 2025

Sorry for delay. I would like to review during this week - assigning to myself.

@lalitb lalitb merged commit 3b2f751 into open-telemetry:main Oct 15, 2025
26 of 27 checks passed
@scottgerring scottgerring deleted the feat/retry-logic branch October 15, 2025 15:30
RetryErrorType::Retryable if attempt < policy.max_retries => {
attempt += 1;
// Use exponential backoff with jitter
otel_warn!(name: "OtlpRetry", message = format!("Retrying operation {:?} due to retryable error: {:?}", operation_name, err));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be info level, as we are not yet giving up and losing data.


match error_type {
RetryErrorType::NonRetryable => {
otel_warn!(name: "OtlpRetry", message = format!("Operation {:?} failed with non-retryable error: {:?}", operation_name, err));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's stick with structured logging as much as possible instead of stringifying. i.e Operation and Error should be own fields.


match error_type {
RetryErrorType::NonRetryable => {
otel_warn!(name: "OtlpRetry", message = format!("Operation {:?} failed with non-retryable error: {:?}", operation_name, err));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name:"OtlpRetry" - this is not the best usage of event names - as its reused many times in this file itself, each time with different event. We need to do dedicated event name for each distinct type of logs, and ensure the schema (fields/etc) is same for a given event name.

RetryErrorType::Throttled(server_delay) if attempt < policy.max_retries => {
attempt += 1;
// Use server-specified delay (overrides exponential backoff)
otel_warn!(name: "OtlpRetry", message = format!("Retrying operation {:?} after server-specified throttling delay: {:?}", operation_name, server_delay));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please downgrade to info level.

pub jitter_ms: u64,
}

/// A runtime stub for when experimental_async_runtime is not enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I follow the use case for this. This PR already adds a runtime implementation for the dedicated thread case in the Sdk crate. Should the OTLP exporter be aware of anything more than letting the Sdk Runtime implementation do its own delay - either using asynchronous delay or blocking sleeps?

endpoint: self.collector_endpoint.to_string(),
});

// Select runtime based on HTTP client feature - if we're using
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might need to be made more robust..OTLP exporter having to pick the runtime feels flaky - it won't know of all possible runtime implementations.


- Update `opentelemetry-proto` and `opentelemetry-http` dependency version to 0.31.0
- Add HTTP compression support with `gzip-http` and `zstd-http` feature flags
- Add retry with exponential backoff and throttling support for HTTP and gRPC exporters
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add some more details here so the user reading the changelog would know how to use this feature. (given this is experimental and opt-in).

Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the PR is merged, but left some comments. We can follow up separately to address them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OTLP Stabilization: Throttling & Retry

5 participants